02 March, 2009

Data mining for terrorists

At the weekend Ben Goldacre wrote an article about the use (or uselessness) of data mining for national security. It is on his blog here. He attacks the idea of data mining for terrorists by considering the number of false positives that would be produced given certain assumptions about the sensitivity and specificity of the test. I am a great fan of Ben's but he is overly fond of using medical statistics models in situations where they don't easily fit.


 

Data mining can mean many things and can be used to address many different problems. Ben's article addresses just one scenario: studying patterns to identify potential terrorists in the UK. You might call this the credit card scenario. It is a technique that works well for detecting credit card fraud and for all sorts of good reasons is unlikely to work for detecting terrorists. Ben links to an article by Bruce Schneier that explains why. It is true that this is very likely to produce an overwhelming number of false positives but I can't believe that the people working on these things haven't realized that. They don't really have to do anything very sophisticated. They just have to ask themselves – how many people in the UK are going to match this pattern anyway?


 

Ben also links to an online book by the National Academies Press which identifies two types of data mining (there are others) - subject-based data mining and pattern recognition. Subject-based data mining is little more than the speeding up of normal methods of investigation. There is an incident or individual or group or potential target and the security forces need to investigate a wide variety of links. There is little serious doubt about the value of this method. It is just an extension of what the police national computer is already used for. It seems very plausible that the security forces would be able to do this even more effectively for a wider variety of situations if they had more information about the UK population on-line. Considerations of specificity and sensitivity don't come into it.


 

Pattern-based data mining is closer to the credit card scenario. If used crudely in isolation from other sources of information to discover potential terrorists in the whole UK population then Ben's calculation becomes relevant and it seems wildly implausible. But several things make it a plausibly useful tool. Perhaps the two most important are:


 

The problem to be addressed might be different.


 

Security forces may be trying to decide whether to raise the national security alert level because there are signs of terrorist activity (although we don't know who they are).


 

Other information may change the situation hugely


 

To see how other information can change things go back to the credit card fraud scenario. We know this works but it does create a large number of false positives – I suspect most credit card owners have had a call at some or another because their pattern of spending has been unusual. But imagine if the police believed that there is someone who has recently been working with stolen credit cards in Leeds who is reselling high value electrical goods. Now the potential of pattern matching would be increased enormously. Something similar would apply to the security forces scenario.


 

In the end this is a matter of whether it is worth the financial and privacy costs and that is a very difficult question when the benefits necessarily have to be described rather vaguely. But I don't see that the screening for cancer model adds much to understanding the benefits.